Unreliable Networks

Learn about the details of an unreliable network, which is the most important fallacy of a distributed system.

Introduction#

A distributed system is a cluster of computer hosts connected by a network. Each computer host has storage, CPU, memory, and other operating system resources. This kind of setup is called shared-nothing architecture because the host doesn't share its resources and communicates only via a network.

Shared-nothing architecture works well because it is easy to implement, reason about, and deploy on cheap commodity hardware with redundancy. A reliable communication network is the most crucial factor for a distributed system’s success.

This section will discuss some critical issues related to networks and host systems in a distributed system.

Client connectivity with servers
Client connectivity with servers

Network#

Here are some of the problems that a network might have:

  • Every request is broken down into smaller chunks called packets before transmitting them over the wire. The packets can be lost in the transmission, which leads to a retry. This results in increased perceived latency of the end user.

  • The network can be congested due to limited bandwidth, leading to slow responses.

  • TCP flow control regulates the data traffic by preventing a fast sender from overwhelming a slow receiver. This results in a throughput drop if the receiver is processing at its peak capacity.

  • The server might have processed the request, but the network dropped the response packet or timed out by the client. In this case, retrying the same request could result in duplicate processing.

  • The optical fiber connection might be damaged.

Remote host#

Here are some of the problems that a remote host might have:

  • The remote host server might crash, and the client will be unable to decide whether the process crashed or is just responding slowly.

  • The remote host server might be processing at peak capacity, and is dropping or queueing new incoming requests.

  • The remote host server might be subject to garbage collection, and is dropping or queueing new incoming requests.

  • The disks used to store data have a lifespan beyond which failures are inevitable.

  • Multiple processes deployed on the remote host can steal operating system resources like CPU, memory, and disk, resulting in delayed processing.

Data center#

Here are some of the problems that a data center might have:

  • Data centers hosting remote hosts can be subject to disasters such as earthquakes or cyclones. Unless there is a backup data center, the result can be catastrophic.

  • A power supply outage to a data center can lead to the downtime of remote hosts in that data center.

  • A top-of-the-rack failure in a data center can take the hosts deployed in that rack offline.

Solutions#

Network failures, remote host failures, and data center failures are inevitable in a distributed system. Therefore, it is impossible to design a completely failure-resistant system. However, it is essential to create a distributed system that can tolerate faults, respond gracefully, and prevent faults from causing failures. These are some solutions we can implement:

  • At the network layer, TCP guarantees acknowledgments and ordered and reliable delivery of packets between two remote hosts. In addition, new enhancements in various network protocols ensure better reliability and throughput with reduced latency.

  • At the host layer, redundancies in the hosts with a load balancer prevents loaded or crashed remote hosts from being overwhelmed. The load balancer periodically monitors the state of the remote hosts and removes slow or crashed hosts from the list of active ones.

  • The client and server should be designed to handle failures gracefully at the application layer. Timeout configuration, idempotency, and duplicate message handling, bulk heading, resource isolation, and load shedding are some techniques to build fault-tolerant systems.

  • At the data center layer, disks are configured with a redundant RAID in the same host and are replicated through replication factors across multiple hosts. Replication is also spread across multiple racks to multiple rack failures causing downtime of a particular piece of data.

  • Dual power supplies and battery backups, availability zones, and multiregion replications are other strategies to prevent data center outages from causing an entire system failure.

Clocks

Two Generals' Problem